﻿WEBVTT

00:00:07.641 --> 00:00:10.308
- So welcome everyone to CS231n.

00:00:11.762 --> 00:00:14.235
I'm super excited to
offer this class again

00:00:14.235 --> 00:00:15.507
for the third time.

00:00:15.507 --> 00:00:17.568
It seems that every
time we offer this class

00:00:17.568 --> 00:00:21.523
it's growing exponentially
unlike most things in the world.

00:00:21.523 --> 00:00:24.434
This is the third time
we're teaching this class.

00:00:24.434 --> 00:00:26.466
The first time we had 150 students.

00:00:26.466 --> 00:00:29.000
Last year, we had 350
students, so it doubled.

00:00:29.000 --> 00:00:32.852
This year we've doubled
again to about 730 students

00:00:32.852 --> 00:00:34.806
when I checked this morning.

00:00:34.806 --> 00:00:38.428
So anyone who was not able
to fit into the lecture hall

00:00:38.428 --> 00:00:40.094
I apologize.

00:00:40.094 --> 00:00:43.189
But, the videos will be
up on the SCPD website

00:00:43.189 --> 00:00:44.931
within about two hours.

00:00:44.931 --> 00:00:46.900
So if you weren't able to come today,

00:00:46.900 --> 00:00:50.889
then you can still check it
out within a couple hours.

00:00:50.889 --> 00:00:55.076
So this class CS231n is
really about computer vision.

00:00:55.076 --> 00:00:57.412
And, what is computer vision?

00:00:57.412 --> 00:01:00.141
Computer vision is really
the study of visual data.

00:01:00.141 --> 00:01:02.578
Since there's so many people
enrolled in this class,

00:01:02.578 --> 00:01:04.522
I think I probably don't
need to convince you

00:01:04.522 --> 00:01:06.219
that this is an important problem,

00:01:06.219 --> 00:01:10.032
but I'm still going to
try to do that anyway.

00:01:10.032 --> 00:01:11.895
The amount of visual data in our world

00:01:11.895 --> 00:01:14.173
has really exploded to a ridiculous degree

00:01:14.173 --> 00:01:15.761
in the last couple of years.

00:01:15.761 --> 00:01:17.613
And, this is largely a
result of the large number

00:01:17.613 --> 00:01:20.398
of sensors in the world.

00:01:20.398 --> 00:01:21.759
Probably most of us in this room

00:01:21.759 --> 00:01:23.064
are carrying around smartphones,

00:01:23.064 --> 00:01:25.004
and each smartphone has one, two,

00:01:25.004 --> 00:01:26.989
or maybe even three cameras on it.

00:01:26.989 --> 00:01:28.974
So I think on average
there's even more cameras

00:01:28.974 --> 00:01:31.114
in the world than there are people.

00:01:31.114 --> 00:01:32.765
And, as a result of all of these sensors,

00:01:32.765 --> 00:01:35.371
there's just a crazy large, massive amount

00:01:35.371 --> 00:01:37.524
of visual data being produced
out there in the world

00:01:37.524 --> 00:01:38.508
each day.

00:01:38.508 --> 00:01:41.239
So one statistic that I
really like to kind of put

00:01:41.239 --> 00:01:43.858
this in perspective is a 2015 study

00:01:43.858 --> 00:01:47.025
from CISCO that estimated that by 2017

00:01:48.919 --> 00:01:51.784
which is where we are now that roughly 80%

00:01:51.784 --> 00:01:54.484
of all traffic on the
internet would be video.

00:01:54.484 --> 00:01:58.074
This is not even counting all the images

00:01:58.074 --> 00:02:00.525
and other types of visual data on the web.

00:02:00.525 --> 00:02:03.880
But, just from a pure
number of bits perspective,

00:02:03.880 --> 00:02:06.002
the majority of bits
flying around the internet

00:02:06.002 --> 00:02:07.476
are actually visual data.

00:02:07.476 --> 00:02:09.547
So it's really critical
that we develop algorithms

00:02:09.547 --> 00:02:13.157
that can utilize and understand this data.

00:02:13.157 --> 00:02:15.370
However, there's a
problem with visual data,

00:02:15.370 --> 00:02:17.813
and that's that it's
really hard to understand.

00:02:17.813 --> 00:02:20.813
Sometimes we call visual
data the dark matter

00:02:20.813 --> 00:02:24.526
of the internet in analogy
with dark matter in physics.

00:02:24.526 --> 00:02:27.437
So for those of you who have
heard of this in physics

00:02:27.437 --> 00:02:31.180
before, dark matter accounts
for some astonishingly large

00:02:31.180 --> 00:02:33.377
fraction of the mass in the universe,

00:02:33.377 --> 00:02:35.167
and we know about it due to the existence

00:02:35.167 --> 00:02:38.293
of gravitational pulls on
various celestial bodies

00:02:38.293 --> 00:02:40.535
and what not, but we
can't directly observe it.

00:02:40.535 --> 00:02:42.838
And, visual data on the
internet is much the same

00:02:42.838 --> 00:02:45.488
where it comprises the majority of bits

00:02:45.488 --> 00:02:49.164
flying around the internet,
but it's very difficult

00:02:49.164 --> 00:02:51.313
for algorithms to actually
go in and understand

00:02:51.313 --> 00:02:54.222
and see what exactly is
comprising all the visual data

00:02:54.222 --> 00:02:55.685
on the web.

00:02:55.685 --> 00:02:58.466
Another statistic that I
like is that of Youtube.

00:02:58.466 --> 00:03:02.309
So roughly every second of clock time

00:03:02.309 --> 00:03:05.303
that happens in the world,
there's something like five hours

00:03:05.303 --> 00:03:07.746
of video being uploaded to Youtube.

00:03:07.746 --> 00:03:09.305
So if we just sit here and count,

00:03:09.305 --> 00:03:12.805
one, two, three, now there's 15 more hours

00:03:13.929 --> 00:03:15.596
of video on Youtube.

00:03:17.076 --> 00:03:18.824
Google has a lot of
employees, but there's no way

00:03:18.824 --> 00:03:21.219
that they could ever
have an employee sit down

00:03:21.219 --> 00:03:24.146
and watch and understand
and annotate every video.

00:03:24.146 --> 00:03:26.856
So if they want to catalog and serve you

00:03:26.856 --> 00:03:29.361
relevant videos and maybe
monetize by putting ads

00:03:29.361 --> 00:03:32.057
on those videos, it's really
crucial that we develop

00:03:32.057 --> 00:03:34.803
technologies that can dive in
and automatically understand

00:03:34.803 --> 00:03:37.053
the content of visual data.

00:03:38.649 --> 00:03:41.379
So this field of computer vision is

00:03:41.379 --> 00:03:44.089
truly an interdisciplinary
field, and it touches

00:03:44.089 --> 00:03:45.864
on many different areas of science

00:03:45.864 --> 00:03:47.564
and engineering and technology.

00:03:47.564 --> 00:03:50.822
So obviously, computer vision's
the center of the universe,

00:03:50.822 --> 00:03:53.914
but sort of as a constellation of fields

00:03:53.914 --> 00:03:56.453
around computer vision, we
touch on areas like physics

00:03:56.453 --> 00:03:59.418
because we need to understand
optics and image formation

00:03:59.418 --> 00:04:01.784
and how images are
actually physically formed.

00:04:01.784 --> 00:04:03.995
We need to understand
biology and psychology

00:04:03.995 --> 00:04:07.879
to understand how animal
brains physically see

00:04:07.879 --> 00:04:09.894
and process visual information.

00:04:09.894 --> 00:04:12.045
We of course draw a lot
on computer science,

00:04:12.045 --> 00:04:14.305
mathematics, and engineering
as we actually strive

00:04:14.305 --> 00:04:16.954
to build computer systems that implement

00:04:16.954 --> 00:04:19.639
our computer vision algorithms.

00:04:19.640 --> 00:04:22.595
So a little bit more about
where I'm coming from

00:04:22.595 --> 00:04:24.985
and about where the teaching
staff of this course

00:04:24.985 --> 00:04:25.992
is coming from.

00:04:25.992 --> 00:04:30.722
Me and my co-instructor
Serena are both PHD students

00:04:30.722 --> 00:04:33.606
in the Stanford Vision Lab which is headed

00:04:33.606 --> 00:04:37.184
by professor Fei-Fei Li,
and our lab really focuses

00:04:37.184 --> 00:04:39.940
on machine learning and
the computer science side

00:04:39.940 --> 00:04:41.184
of things.

00:04:41.184 --> 00:04:43.308
I work a little bit more
on language and vision.

00:04:43.308 --> 00:04:44.900
I've done some projects in that.

00:04:44.900 --> 00:04:46.658
And, other folks in our group have worked

00:04:46.658 --> 00:04:48.525
a little bit on the neuroscience
and cognitive science

00:04:48.525 --> 00:04:49.775
side of things.

00:04:52.541 --> 00:04:54.404
So as a bit of introduction,
you might be curious

00:04:54.404 --> 00:04:57.557
about how this course relates
to other courses at Stanford.

00:04:57.557 --> 00:05:01.408
So we kind of assume a basic
introductory understanding

00:05:01.408 --> 00:05:02.848
of computer vision.

00:05:02.848 --> 00:05:04.787
So if you're kind of an undergrad,

00:05:04.787 --> 00:05:06.926
and you've never seen
computer vision before,

00:05:06.926 --> 00:05:09.698
maybe you should've taken
CS131 which was offered

00:05:09.698 --> 00:05:14.229
earlier this year by Fei-Fei
and Juan Carlos Niebles.

00:05:14.229 --> 00:05:17.361
There was a course taught last quarter

00:05:17.361 --> 00:05:20.836
by Professor Chris
Manning and Richard Socher

00:05:20.836 --> 00:05:22.705
about the intersection of deep learning

00:05:22.705 --> 00:05:24.925
and natural language processing.

00:05:24.925 --> 00:05:27.512
And, I imagine a number of
you may have taken that course

00:05:27.512 --> 00:05:28.595
last quarter.

00:05:31.482 --> 00:05:33.785
There'll be some overlap
between this course and that,

00:05:33.785 --> 00:05:35.769
but we're really focusing
on the computer vision

00:05:35.769 --> 00:05:38.861
side of thing, and really
focusing all of our motivation

00:05:38.861 --> 00:05:40.444
in computer vision.

00:05:41.361 --> 00:05:43.078
Also concurrently taught this quarter

00:05:43.078 --> 00:05:47.378
is CS231a taught by
Professor Silvio Savarese.

00:05:47.378 --> 00:05:52.306
And, CS231a really focuses
is a more all encompassing

00:05:52.306 --> 00:05:54.010
computer vision course.

00:05:54.010 --> 00:05:57.569
It's focusing on things
like 3D reconstruction,

00:05:57.569 --> 00:05:59.896
on matching and robotic vision,

00:05:59.896 --> 00:06:01.412
and it's a bit more all encompassing

00:06:01.412 --> 00:06:03.813
with regards to vision than our course.

00:06:03.813 --> 00:06:06.647
And, this course, CS231n, really focuses

00:06:06.647 --> 00:06:09.358
on a particular class
of algorithms revolving

00:06:09.358 --> 00:06:11.922
around neural networks and
especially convolutional

00:06:11.922 --> 00:06:13.786
neural networks and their applications

00:06:13.786 --> 00:06:16.228
to various visual recognition tasks.

00:06:16.228 --> 00:06:17.725
Of course, there's also a number

00:06:17.725 --> 00:06:19.178
of seminar courses that are taught,

00:06:19.178 --> 00:06:21.154
and you'll have to check the syllabus

00:06:21.154 --> 00:06:24.631
and course schedule for
more details on those

00:06:24.631 --> 00:06:27.867
'cause they vary a bit each year.

00:06:27.867 --> 00:06:29.914
So this lecture is normally given

00:06:29.914 --> 00:06:31.672
by Professor Fei-Fei Li.

00:06:31.672 --> 00:06:34.174
Unfortunately, she wasn't
able to be here today,

00:06:34.174 --> 00:06:36.439
so instead for the majority of the lecture

00:06:36.439 --> 00:06:38.463
we're going to tag team a little bit.

00:06:38.463 --> 00:06:41.996
She actually recorded a
bit of pre-recorded audio

00:06:41.996 --> 00:06:44.772
describing to you the
history of computer vision

00:06:44.772 --> 00:06:48.229
because this class is a
computer vision course,

00:06:48.229 --> 00:06:50.456
and it's very critical and
important that you understand

00:06:50.456 --> 00:06:53.289
the history and the context
of all the existing work

00:06:53.289 --> 00:06:55.183
that led us to these developments

00:06:55.183 --> 00:06:58.000
of convolutional neural
networks as we know them today.

00:06:58.500 --> 00:07:00.000
I'll let virtual Fei-Fei take over

00:07:00.398 --> 00:07:01.915
[laughing]

00:07:01.915 --> 00:07:03.800
and give you a brief
introduction to the history

00:07:04.000 --> 00:07:05.500
of computer vision.

00:07:08.610 --> 00:07:15.309
Okay let's start with today's agenda.
So we have two topics to cover one is a

00:07:15.309 --> 00:07:20.620
brief history of computer vision and the
other one is the overview of our course

00:07:20.620 --> 00:07:28.539
CS 231 so we'll start with a very
brief history of where vision comes

00:07:28.540 --> 00:07:36.100
from when did computer vision start and
where we are today. The history the

00:07:36.100 --> 00:07:44.770
history of vision can go back many many
years ago in fact about 543 million

00:07:44.770 --> 00:07:50.800
years ago. What was life like during that
time? Well the earth was mostly water

00:07:50.920 --> 00:07:58.300
there were a few species of animals
floating around in the ocean and life

00:07:58.300 --> 00:08:03.730
was very chill. Animals didn't move around
much there they don't have eyes or

00:08:03.730 --> 00:08:09.640
anything when food swims by they grab
them if the food didn't swim by they

00:08:09.640 --> 00:08:17.140
just float around but something really
remarkable happened around 540 million

00:08:17.140 --> 00:08:25.509
years ago. From fossil studies zoologists
found out within a very short period of

00:08:25.509 --> 00:08:33.820
time —  ten million years — the number of
animal species just exploded. It went

00:08:33.820 --> 00:08:41.500
from a few of them to hundreds of
thousands and that was strange — what caused this?

00:08:41.500 --> 00:08:47.920
There were many theories but for many
years it was a mystery evolutionary

00:08:47.920 --> 00:08:55.540
biologists call this evolution's Big Bang.
A few years ago an Australian zoologist

00:08:55.540 --> 00:09:01.299
called Andrew Parker proposed one of the
most convincing theory from the studies

00:09:01.299 --> 00:09:07.030
of fossils
he discovered around 540 million years

00:09:07.030 --> 00:09:19.310
ago the first animals developed eyes and
the onset of vision started this

00:09:19.310 --> 00:09:26.610
explosive speciation phase. Animals can
suddenly see; once you can see life

00:09:26.610 --> 00:09:32.580
becomes much more proactive. Some
predators went after prey and prey

00:09:32.580 --> 00:09:39.980
have to escape from predators so the
evolution or onset of vision started a

00:09:39.980 --> 00:09:46.860
evolutionary arms race and animals had
to evolve quickly in order to survive as

00:09:46.860 --> 00:09:54.870
a species so that was the beginning of
vision in animals after 540 million

00:09:54.870 --> 00:10:01.380
years vision has developed into the
biggest sensory system of almost all

00:10:01.380 --> 00:10:09.660
animals especially intelligent animals
in humans we have almost 50% of the

00:10:09.660 --> 00:10:15.450
neurons in our cortex involved in visual
processing it is the biggest sensory

00:10:15.450 --> 00:10:22.590
system that enables us to survive, work,
move around, manipulate things,

00:10:22.590 --> 00:10:29.730
communicate, entertain, and many things.
The vision is really important for

00:10:29.730 --> 00:10:38.930
animals and especially intelligent
animals. So that was a quick story of

00:10:38.930 --> 00:10:48.329
biological vision. What about humans, the
history of humans making mechanical

00:10:48.329 --> 00:10:56.450
vision or cameras? Well one of the early
cameras that we know today is from the

00:10:56.450 --> 00:11:04.410
1600s, the Renaissance period of time,
camera obscura and this is a camera

00:11:04.410 --> 00:11:13.730
based on pinhole camera theories. It's
very similar to, it's very similar to the

00:11:13.730 --> 00:11:21.390
to the early eyes that animals developed
with a hole that collects lights

00:11:21.390 --> 00:11:28.020
and then a plane in the back of the
camera that collects the information and

00:11:28.020 --> 00:11:36.560
project the imagery. So
as cameras evolved, today we have cameras

00:11:36.560 --> 00:11:40.910
everywhere this is one of the most
popular sensors people use from

00:11:40.910 --> 00:11:49.040
smartphones to to other sensors. In the
mean time biologists started

00:11:49.040 --> 00:11:56.510
studying the mechanism of vision. One of
the most influential work in both human

00:11:56.510 --> 00:12:02.690
vision where animal vision as well as
that inspired computer vision is the

00:12:02.690 --> 00:12:10.850
work done by Hubel and Wiesel in the 50s
and 60s using electrophysiology.

00:12:10.850 --> 00:12:18.170
What they were asking, the question is "what was the visual processing mechanism like

00:12:18.170 --> 00:12:26.600
in primates, in mammals" so they chose
to study cat brain which is more or less

00:12:26.600 --> 00:12:32.090
similar to human brain from a visual
processing point of view. What they did

00:12:32.090 --> 00:12:37.490
is to stick some electrodes in the back
of the cat brain which is where the

00:12:37.490 --> 00:12:45.830
primary visual cortex area is and then
look at what stimuli makes the neurons

00:12:45.830 --> 00:12:52.970
in the in the back in the primary visual
cortex of cat brain respond excitedly

00:12:52.970 --> 00:13:00.380
what they learned is that there are many
types of cells in the, in the primary

00:13:00.380 --> 00:13:05.630
visual cortex part of the the cat brain
but one of the most important cell is

00:13:05.630 --> 00:13:12.080
the simple cells they respond to
oriented edges when they move in certain

00:13:12.080 --> 00:13:18.410
directions. Of course there are also more
complex cells but by and large what they

00:13:18.410 --> 00:13:26.060
discovered is visual processing starts
with simple structure of the visual world,

00:13:26.060 --> 00:13:32.210
oriented edges and as information
moves along the visual processing

00:13:32.210 --> 00:13:38.560
pathway the brain builds up the
complexity of the visual information

00:13:38.560 --> 00:13:46.280
until it can recognize the complex
visual world. So the history of

00:13:46.280 --> 00:13:55.070
computer vision also starts around early
60s. Block World is a set of work

00:13:55.070 --> 00:14:00.410
published by Larry Roberts which is
widely known as one of the first,

00:14:00.410 --> 00:14:07.250
probably the first PhD thesis of
computer vision where the visual world

00:14:07.250 --> 00:14:13.850
was simplified into simple geometric
shapes and the goal is to be able to

00:14:13.850 --> 00:14:23.419
recognize them and reconstruct what
these shapes are. In 1966 there was a now

00:14:23.419 --> 00:14:31.550
famous MIT summer project called "The
Summer Vision Project." The goal of this

00:14:31.550 --> 00:14:38.440
Summer Vision Project, I read: "is an
attempt to use our summer workers

00:14:38.440 --> 00:14:44.240
effectively in a construction of a
significant part of a visual system."

00:14:44.240 --> 00:14:47.780
So the goal is in one summer we're gonna work
out

00:14:47.780 --> 00:14:54.590
the bulk of the visual system. That was
an ambitious goal. Fifty years have

00:14:54.590 --> 00:15:02.240
passed; the field of computer vision has
blossomed from one summer project into a

00:15:02.240 --> 00:15:07.610
field of thousands of researchers
worldwide still working on some of the

00:15:07.610 --> 00:15:13.940
most fundamental problems of vision. We
still have not yet solved vision but it

00:15:13.940 --> 00:15:21.380
has grown into one of the most important
and fastest growing areas

00:15:21.380 --> 00:15:27.410
of artificial intelligence. Another
person that we should pay tribute to is

00:15:27.410 --> 00:15:34.550
David Marr. David Marr was a MIT vision
scientist and he has written an

00:15:34.550 --> 00:15:41.510
influential book in the late 70s about
what he thinks vision is and how we

00:15:41.510 --> 00:15:48.200
should go about computer vision
and developing algorithms that can

00:15:48.200 --> 00:15:57.020
enable computers to recognize the visual
world. The thought process in his,

00:15:57.020 --> 00:16:02.440
in David Mars book is
that in order to take an image and

00:16:02.440 --> 00:16:10.639
arrive at a final holistic full 3d
representation of the visual world we

00:16:10.640 --> 00:16:16.360
have to go through several process. The
first process is what he calls "primal sketch;"

00:16:16.360 --> 00:16:23.060
this is where mostly the edges,
the bars, the ends, the virtual lines, the

00:16:23.060 --> 00:16:28.970
curves, the boundaries, are represented
and this is very much inspired by what

00:16:28.970 --> 00:16:34.639
neuroscientists have seen: Hubel and
Wiesel told us the early stage of visual

00:16:34.639 --> 00:16:41.420
processing has a lot to do with simple
structures like edges. Then the next step

00:16:41.420 --> 00:16:45.860
after the edges and the curves is what David Marr calls

00:16:45.860 --> 00:16:52.300
"two-and-a-half d sketch;" this is where we
start to piece together the surfaces,

00:16:52.300 --> 00:16:58.840
the depth information, the layers, or the
discontinuities of the visual scene,

00:16:58.850 --> 00:17:04.930
and then eventually we put everything
together and have a 3d model

00:17:04.930 --> 00:17:11.579
hierarchically organized in terms of
surface and volumetric primitives and so on.

00:17:11.579 --> 00:17:20.719
So that was a very idealized thought
process of what vision is and this way

00:17:20.720 --> 00:17:25.790
of thinking actually has dominated
computer vision for several decades and

00:17:25.790 --> 00:17:31.940
is also a very intuitive way for
students to enter the field of vision

00:17:31.940 --> 00:17:38.230
and think about how we can deconstruct
the visual information.

00:17:39.310 --> 00:17:48.380
Another very important seminal group of
work happened in the 70s where people

00:17:48.380 --> 00:17:55.160
began to ask the question "how can we
move beyond the simple block world and

00:17:55.160 --> 00:18:02.509
start recognizing or representing real
world objects?" Think about the 70s,

00:18:02.509 --> 00:18:07.910
it's the time that there's very little
data available; computers are extremely

00:18:07.910 --> 00:18:13.360
slow, PCs are not even around,
but computer scientists are starting to

00:18:13.360 --> 00:18:20.170
think about how we can recognize and
represent objects. So in Palo Alto

00:18:20.170 --> 00:18:26.649
both at Stanford as well as SRI, two
groups of scientists that propose

00:18:26.649 --> 00:18:32.740
similar ideas: one is called "generalized
cylinder," one is called "pictorial structure."

00:18:32.740 --> 00:18:40.060
The basic idea is that every
object is composed of simple geometric

00:18:40.060 --> 00:18:45.510
primitives; for example a person can be
pieced together by generalized

00:18:45.510 --> 00:18:51.339
cylindrical shapes or a person can be
pieced together by critical part in

00:18:51.339 --> 00:18:56.079
their elastic distance between
these parts

00:18:56.079 --> 00:19:03.880
so either representation is a way to
reduce the complex structure of the

00:19:03.880 --> 00:19:11.140
object into a collection of
simpler shapes and their geometric configuration.

00:19:11.140 --> 00:19:19.220
These work have been
influential for quite a few, quite a few years

00:19:19.220 --> 00:19:27.630
and then in the 80s David Lowe, here
is another example of thinking how to

00:19:27.630 --> 00:19:33.699
reconstruct or recognize the visual
world from simple world structures, this

00:19:33.699 --> 00:19:43.440
work is by David Lowe which he tries to
recognize razors by constructing

00:19:43.440 --> 00:19:50.860
lines and edges and and mostly
straight lines and their combination.

00:19:50.860 --> 00:20:01.140
So there was a lot of effort in trying to
think what what is the tasks in computer

00:20:01.149 --> 00:20:10.410
vision in the 60s 70s and 80s and frankly
it was very hard to solve the problem of

00:20:10.410 --> 00:20:17.980
object recognition; everything I've shown
you so far are very audacious ambitious

00:20:17.980 --> 00:20:24.160
attempts but they remain at the level of
toy examples

00:20:24.160 --> 00:20:30.819
or just a few examples. Not a lot of
progress have been made in terms of

00:20:30.819 --> 00:20:38.019
delivering something that can work in
real world. So as people think about what

00:20:38.019 --> 00:20:43.709
are the problems to solving vision one
important question came around is:

00:20:43.709 --> 00:20:50.200
if object recognition is too hard,
maybe we should first do object segmentation,

00:20:50.200 --> 00:20:58.760
that is the task of taking
an image and group the pixels into meaningful areas.

00:20:58.760 --> 00:21:03.880
We might not know the
pixels that group together is called a person,

00:21:03.880 --> 00:21:10.140
but we can extract out all the
pixels that belong to the person from its background;

00:21:10.140 --> 00:21:15.339
that is called image
segmentation. So here's one very early

00:21:15.339 --> 00:21:21.759
seminal work by Jitendra Malik and his
student Jianbo Shi from Berkeley from

00:21:21.760 --> 00:21:29.880
using a graph theory algorithm for the
problem of image segmentation.

00:21:29.880 --> 00:21:39.600
Here's another problem that made some headway
ahead of many other problems in

00:21:39.610 --> 00:21:45.850
computer vision, which is face detection.
Faces one of the most important objects

00:21:45.850 --> 00:21:51.779
to humans, probably the most important
objects to humans, around the time of

00:21:51.779 --> 00:21:59.079
1999 to 2000 machine learning techniques,
especially statistical machine

00:21:59.079 --> 00:22:05.220
learning techniques start to gain
momentum. These are techniques such as

00:22:05.220 --> 00:22:11.620
support vector machines, boosting,
graphical models, including the first

00:22:11.620 --> 00:22:18.449
wave of neural networks. One particular
work that made a lot of contribution was

00:22:18.449 --> 00:22:24.939
using AdaBoost algorithm to do
real-time face detection by Paul Viola

00:22:24.939 --> 00:22:31.779
and Michael Jones and there's a lot to
admire in this work. It was done in 2001

00:22:31.779 --> 00:22:36.730
when computer chips are still very very
slow but they're able to do face

00:22:36.730 --> 00:22:42.550
detection in
images in near-real-time and after the

00:22:42.550 --> 00:22:50.800
publication of this paper in five years
time, 2006, Fujifilm rolled out the first

00:22:50.800 --> 00:22:58.960
digital camera that has a real-time
face detector in the in the camera so it

00:22:58.960 --> 00:23:05.960
was a very rapid transfer from basic
science research to real world application.

00:23:05.960 --> 00:23:13.920
So as a field we continue to
explore how we can do object recognition

00:23:13.930 --> 00:23:22.720
better so one of the very influential
way of thinking in the late 90s til the

00:23:22.720 --> 00:23:31.300
first 10 years of 2000 is feature based
object recognition and here is a seminal

00:23:31.300 --> 00:23:39.670
work by David Lowe called SIFT feature.
The idea is that to match and the entire object

00:23:39.670 --> 00:23:44.860
for example here is a stop sign to
another stop sight is very difficult

00:23:44.860 --> 00:23:51.060
because there might be all kinds of
changes due to camera angles, occlusion,

00:23:51.060 --> 00:23:57.210
viewpoint, lighting, and just the
intrinsic variation of the object itself

00:23:57.210 --> 00:24:04.680
but it's inspired to observe that there
are some parts of the object,

00:24:04.680 --> 00:24:15.000
some features, that tend to remain diagnostic
and invariant to changes so the task of

00:24:15.010 --> 00:24:21.610
object recognition began with identifying
these critical features on the object

00:24:21.610 --> 00:24:28.569
and then match the features to a similar
object, that's a easier task than pattern

00:24:28.569 --> 00:24:36.070
matching the entire object. So here is a
figure from his paper where it shows

00:24:36.070 --> 00:24:42.060
that a handful, several dozen SIFT
features from one stop sign are

00:24:42.060 --> 00:24:49.440
identified and matched to the SIFT
features of another stop sign.

00:24:51.130 --> 00:24:59.330
Using the same building block which is
features, diagnostic features in images,

00:24:59.330 --> 00:25:04.780
we have as a field has made another step
forward and start to recognizing

00:25:04.780 --> 00:25:12.320
holistic scenes. Here is an example
algorithm called Spatial Pyramid Matching;

00:25:12.320 --> 00:25:18.620
the idea is that there are
features in the images that can give us

00:25:18.620 --> 00:25:23.750
clues about which type of scene it is,
whether it's a landscape or a kitchen or

00:25:23.750 --> 00:25:31.580
a highway and so on and this particular
work takes these features from different

00:25:31.580 --> 00:25:37.130
parts of the image and in different
resolutions and put them together in a

00:25:37.130 --> 00:25:44.780
feature descriptor and then we do
support vector machine algorithm on top of that.

00:25:44.780 --> 00:25:53.930
Similarly a very similar work
has gained momentum in human recognition

00:25:53.930 --> 00:26:02.990
so putting together these features well
we have a number of work that looks at

00:26:02.990 --> 00:26:10.490
how we can compose human bodies in more
realistic images and recognize them.

00:26:10.490 --> 00:26:15.710
So one work is called the "histogram of
gradients," another work is called

00:26:15.710 --> 00:26:26.770
"deformable part models," so as you
can see as we move from the 60s 70s 80s

00:26:26.770 --> 00:26:34.160
towards the first decade of the 21st
century one thing is changing and that's

00:26:34.160 --> 00:26:40.700
the quality of the pictures were no
longer, with the Internet the the the

00:26:40.700 --> 00:26:45.680
growth of the Internet the digital
cameras were having better and better

00:26:45.680 --> 00:26:54.380
data to study computer vision. So one of
the outcome in the early 2000s is that

00:26:54.380 --> 00:27:02.840
the field of computer vision has defined
a very important building block problem to solve.

00:27:02.840 --> 00:27:05.600
It's not the only problem to solve but

00:27:05.600 --> 00:27:11.120
in terms of recognition this is a very
important problem to solve which is

00:27:11.120 --> 00:27:18.950
object recognition. I talked about object
recognition all along but in the early

00:27:18.950 --> 00:27:26.600
2000s we began to have benchmark data
set that can enable us to measure the

00:27:26.600 --> 00:27:32.930
progress of object recognition. One of
the most influential benchmark data set

00:27:32.930 --> 00:27:41.480
is called PASCAL Visual Object Challenge,
and it's a data set composed of 20

00:27:41.480 --> 00:27:48.500
object classes, three of them are shown
here: train, airplane, person; I think it

00:27:48.500 --> 00:27:57.440
also has cows, bottles, cats, and so on; and
the data set is composed of several

00:27:57.440 --> 00:28:04.280
thousand to ten thousand images per
category and then the field different

00:28:04.280 --> 00:28:11.750
groups develop algorithm to test
against the testing set and see how we

00:28:11.750 --> 00:28:19.870
have made progress. So here is a figure
that shows from year 2007 to year 2012.

00:28:19.870 --> 00:28:31.100
The performance on detecting objects the
20 object in this image in a in a

00:28:31.100 --> 00:28:38.680
benchmark data set has steadily
increased. So there was a lot of progress made.

00:28:38.680 --> 00:28:45.170
Around that time a group of us from
Princeton to Stanford also began to ask

00:28:45.170 --> 00:28:53.330
a harder question to ourselves as well
as our field which is: are we ready

00:28:53.330 --> 00:29:00.260
to recognize every object or most of the
object in the world. It's also motivated

00:29:00.260 --> 00:29:07.970
by an observation that is rooted in
machine learning which is that most of

00:29:07.970 --> 00:29:12.410
the machine learning algorithms it
doesn't matter if it's graphical model,

00:29:12.410 --> 00:29:20.070
or support vector machine, or AdaBoost,
is very likely to overfit in

00:29:20.070 --> 00:29:25.410
the training process and part of the
problem is visual data is very complex

00:29:25.410 --> 00:29:32.700
because it's complex our models tend to
have a high dimension a high dimension

00:29:32.700 --> 00:29:37.559
of input and have to have a lot of
parameters to fit and when we don't have

00:29:37.559 --> 00:29:44.160
enough training data overfitting happens
very fast and then we cannot generalize

00:29:44.160 --> 00:29:52.440
very well. So motivated by this dual
reason, one is just want to recognize the

00:29:52.440 --> 00:29:58.340
world of all the objects, the other
one is to come back the machine learning

00:29:58.340 --> 00:30:04.620
overcome the the machine learning
bottleneck of overfitting, we began this

00:30:04.620 --> 00:30:11.140
project called ImageNet. We wanted to
put together the largest possible dataset

00:30:11.140 --> 00:30:17.900
of all the pictures we can find, the
world of objects, and use that for

00:30:17.910 --> 00:30:23.250
training as well as for benchmarking. So
it was a project that took us about

00:30:23.250 --> 00:30:30.330
three years, lots of hard work; it
basically began with downloading

00:30:30.330 --> 00:30:37.620
billions of images from the internet
organized by the dictionary we called

00:30:37.620 --> 00:30:45.770
WordNet which is tens of thousands of
object classes and then we have to use

00:30:45.770 --> 00:30:52.230
some clever crowd engineering trick a
method using Amazon Mechanical Turk

00:30:52.230 --> 00:31:02.270
platform to sort, clean, label each of the
images. The end result is a ImageNet of

00:31:02.270 --> 00:31:10.830
almost 15 million or 40 million plus
images organized in twenty-two thousand

00:31:10.830 --> 00:31:20.880
categories of objects and scenes and
this is the gigantic, probably the

00:31:20.880 --> 00:31:29.289
biggest dataset produced in the field of
AI at that time and it began to push

00:31:29.289 --> 00:31:35.759
forward the algorithm development of
object recognition into another phase.

00:31:35.759 --> 00:31:41.200
Especially important is how to benchmark
the progress

00:31:41.200 --> 00:31:49.419
so starting 2009 the ImageNet team rolled
out an international challenge called

00:31:49.419 --> 00:31:57.309
ImageNet Large-Scale Visual Recognition
Challenge and for this challenge we put

00:31:57.309 --> 00:32:06.190
together a more stringent test set of
1.4 million objects across 1,000 object

00:32:06.190 --> 00:32:13.629
classes and this is to test the image
classification recognition results for

00:32:13.629 --> 00:32:21.989
the computer vision algorithms. So here's
the example picture and if an algorithm

00:32:21.989 --> 00:32:32.259
can output 5 labels and and top five
labels includes the correct object in

00:32:32.259 --> 00:32:42.909
this picture then we call this a success.
So here is a result summary of the

00:32:42.909 --> 00:32:49.720
ImageNet Challenge, of the image
classification result from 2010

00:32:49.720 --> 00:33:00.740
to 2015 so on x axis you see the
years and the y axis you see the error rate.

00:33:00.740 --> 00:33:06.820
So the good news is the error rate
is steadily decreasing to the point by

00:33:06.820 --> 00:33:15.369
2012 the error rate is so low is on par
with what humans can do and here a human

00:33:15.369 --> 00:33:25.359
I mean a single Stanford PhD student who
spend weeks doing this task as if

00:33:25.359 --> 00:33:32.470
he were a computer participating in the
ImageNet Challenge. So that's a lot of

00:33:32.470 --> 00:33:39.669
progress made even though we have not
solved all the problems of object

00:33:39.669 --> 00:33:43.110
recognition which you'll learn about in
this class

00:33:43.110 --> 00:33:50.490
but to go from an error rate that's
unacceptable for real-world application

00:33:50.490 --> 00:33:56.400
all the way to on par being on par with
humans in ImageNet challenge, the field

00:33:56.400 --> 00:34:05.640
took only a few years. And one particular
moment you should notice on this graph

00:34:05.640 --> 00:34:15.719
is the the year 2012. In the first two
years our error rate hovered around 25

00:34:15.719 --> 00:34:25.649
percent but in 2012 the error rate was
dropped more almost 10 percent to 16

00:34:25.650 --> 00:34:32.969
percent even though now it's better but
that drop was very significant and the

00:34:32.969 --> 00:34:42.569
winning algorithm of that year is a
convolutional neural network model that

00:34:42.570 --> 00:34:49.850
beat all other algorithms around that
time to win the ImageNet challenge and

00:34:49.850 --> 00:34:58.200
this is the focus of our whole course
this quarter is to look at to have a

00:34:58.200 --> 00:35:05.700
deep dive into what convolutional neural
network models are and another name for

00:35:05.700 --> 00:35:10.370
this is deep learning by by popular

00:35:10.520 --> 00:35:15.330
popular name now it's called deep
learning and to look at what these

00:35:15.330 --> 00:35:20.429
models are what are the principles what
are the good practices what are the

00:35:20.429 --> 00:35:26.400
recent progress of this model, but
here is where the history was made is

00:35:26.400 --> 00:35:33.000
that we, around 2012 convolutional
neural network model or deep learning

00:35:33.000 --> 00:35:41.309
models showed the tremendous capacity
and ability in making a good progress in

00:35:41.309 --> 00:35:47.370
the field of computer vision along with
several other sister fields like natural

00:35:47.370 --> 00:35:51.900
language processing and speech
recognition. So without further ado I'm

00:35:51.900 --> 00:36:00.630
going to hand the rest of the lecture to
to Justin to talk about the overview of

00:36:00.630 --> 00:36:02.500
CS 231n.

00:36:03.000 --> 00:36:04.763
Alright, thanks so much Fei-Fei.

00:36:05.000 --> 00:36:08.158
I'll take it over from here.

00:36:08.189 --> 00:36:09.910
So now I want to shift gears a little bit

00:36:09.910 --> 00:36:14.077
and talk a little bit more
about this class CS231n.

00:36:15.436 --> 00:36:18.636
So this class focuses
on one of these most,

00:36:18.636 --> 00:36:20.814
so the primary focus of this class

00:36:20.814 --> 00:36:22.950
is this image classification problem

00:36:22.950 --> 00:36:25.269
which we previewed a
little bit in the contex

00:36:25.269 --> 00:36:27.037
of the ImageNet Challenge.

00:36:27.037 --> 00:36:28.848
So in image classification, again,

00:36:28.848 --> 00:36:31.470
the setup is that your
algorithm looks at an image

00:36:31.470 --> 00:36:34.048
and then picks from among
some fixed set of categories

00:36:34.048 --> 00:36:36.443
to classify that image.

00:36:36.443 --> 00:36:39.550
And, this might seem like
somewhat of a restrictive

00:36:39.550 --> 00:36:42.506
or artificial setup, but
it's actual quite general.

00:36:42.506 --> 00:36:45.521
And, this problem can be applied
in many different settings

00:36:45.521 --> 00:36:49.630
both in industry and academia
and many different places.

00:36:49.630 --> 00:36:52.957
So for example, you could
apply this to recognizing food

00:36:52.957 --> 00:36:54.906
or recognizing calories
in food or recognizing

00:36:54.906 --> 00:36:58.043
different artworks, different
product out in the world.

00:36:58.043 --> 00:37:01.576
So this relatively basic
tool of image classification

00:37:01.576 --> 00:37:04.272
is super useful on its
own and could be applied

00:37:04.272 --> 00:37:08.503
all over the place for many
different applications.

00:37:08.503 --> 00:37:10.685
But, in this course,
we're also going to talk

00:37:10.685 --> 00:37:13.806
about several other visual
recognition problems

00:37:13.806 --> 00:37:16.673
that build upon many of
the tools that we develop

00:37:16.673 --> 00:37:19.660
for the purpose of image classification.

00:37:19.660 --> 00:37:21.266
We'll talk about other problems

00:37:21.266 --> 00:37:24.783
such as object detection
or image captioning.

00:37:24.783 --> 00:37:26.665
So the setup in object detection

00:37:26.665 --> 00:37:28.435
is a little bit different.

00:37:28.435 --> 00:37:30.709
Rather than classifying an entire image

00:37:30.709 --> 00:37:33.727
as a cat or a dog or a horse or whatnot,

00:37:33.727 --> 00:37:35.851
instead we want to go in
and draw bounding boxes

00:37:35.851 --> 00:37:38.461
and say that there is a
dog here, and a cat here,

00:37:38.461 --> 00:37:40.351
and a car over in the background,

00:37:40.351 --> 00:37:42.186
and draw these boxes describing

00:37:42.186 --> 00:37:44.110
where objects are in the image.

00:37:44.110 --> 00:37:46.322
We'll also talk about image captioning

00:37:46.322 --> 00:37:47.745
where given an image the system

00:37:47.745 --> 00:37:50.111
now needs to produce a
natural language sentence

00:37:50.111 --> 00:37:51.475
describing the image.

00:37:51.475 --> 00:37:53.691
It sounds like a really hard, complicated,

00:37:53.691 --> 00:37:55.599
and different problem, but we'll see

00:37:55.599 --> 00:37:57.219
that many of the tools that we develop

00:37:57.219 --> 00:37:58.963
in service of image classification

00:37:58.963 --> 00:38:02.880
will be reused in these
other problems as well.

00:38:06.482 --> 00:38:08.451
So we mentioned this before in the context

00:38:08.451 --> 00:38:11.245
of the ImageNet Challenge,
but one of the things

00:38:11.245 --> 00:38:12.966
that's really driven the
progress of the field

00:38:12.966 --> 00:38:14.398
in recent years has been this adoption

00:38:14.398 --> 00:38:17.933
of convolutional neural networks or CNNs

00:38:17.933 --> 00:38:20.350
or sometimes called convnets.

00:38:20.350 --> 00:38:24.008
So if we look at the
algorithms that have won

00:38:24.008 --> 00:38:26.827
the ImageNet Challenge for
the last several years,

00:38:26.827 --> 00:38:30.479
in 2011 we see this method from Lin et al

00:38:30.479 --> 00:38:32.631
which is still hierarchical.

00:38:32.631 --> 00:38:34.860
It consists of multiple layers.

00:38:34.860 --> 00:38:36.769
So first we compute some features,

00:38:36.769 --> 00:38:38.742
next we compute some local invariances,

00:38:38.742 --> 00:38:41.211
some pooling, and go
through several layers

00:38:41.211 --> 00:38:42.939
of processing, and then finally feed

00:38:42.939 --> 00:38:46.276
this resulting descriptor to a linear SVN.

00:38:46.276 --> 00:38:49.230
What you'll notice here is that
this is still hierarchical.

00:38:49.230 --> 00:38:50.553
We're still detecting edges.

00:38:50.553 --> 00:38:52.583
We're still having notions of invariance.

00:38:52.583 --> 00:38:54.411
And, many of these
intuitions will carry over

00:38:54.411 --> 00:38:56.177
into convnets.

00:38:56.177 --> 00:38:59.115
But, the breakthrough
moment was really in 2012

00:38:59.115 --> 00:39:02.032
when Jeff Hinton's group in Toronto

00:39:03.693 --> 00:39:07.066
together with Alex
Krizhevsky and Ilya Sutskever

00:39:07.066 --> 00:39:09.225
who were his PHD student at that time

00:39:09.225 --> 00:39:12.504
created this seven layer
convolutional neural network

00:39:12.504 --> 00:39:15.212
now known as AlexNet,
then called Supervision

00:39:15.212 --> 00:39:18.169
which just did very, very well
in the ImageNet competition

00:39:18.169 --> 00:39:19.651
in 2012.

00:39:19.651 --> 00:39:22.484
And, since then every year
the winner of ImageNet

00:39:22.484 --> 00:39:24.197
has been a neural network.

00:39:24.197 --> 00:39:25.911
And, the trend has been
that these networks

00:39:25.911 --> 00:39:28.096
are getting deeper and deeper each year.

00:39:28.096 --> 00:39:31.561
So AlexNet was a seven or
eight layer neural network

00:39:31.561 --> 00:39:33.592
depending on how exactly you count things.

00:39:33.592 --> 00:39:35.561
In 2015 we had these much deeper networks.

00:39:35.561 --> 00:39:39.518
GoogleNet from Google
and VGG, the VGG network

00:39:39.518 --> 00:39:43.172
from Oxford which was about
19 layers at that time.

00:39:43.172 --> 00:39:44.971
And, then in 2015 it got really crazy

00:39:44.971 --> 00:39:48.598
and this paper came out
from Microsoft Research Asia

00:39:48.598 --> 00:39:52.373
called Residual Networks which
were 152 layers at that time.

00:39:52.373 --> 00:39:55.037
And, since then it turns out you can get

00:39:55.037 --> 00:39:56.745
a little bit better if you go up to 200,

00:39:56.745 --> 00:39:58.505
but you run our of memory on your GPUs.

00:39:58.505 --> 00:40:00.352
We'll get into all of that later,

00:40:00.352 --> 00:40:03.096
but the main takeaway here
is that convolutional neural

00:40:03.096 --> 00:40:04.824
networks really had
this breakthrough moment

00:40:04.824 --> 00:40:06.825
in 2012, and since then there's been

00:40:06.825 --> 00:40:08.783
a lot of effort focused
in tuning and tweaking

00:40:08.783 --> 00:40:11.340
these algorithms to make them
perform better and better

00:40:11.340 --> 00:40:13.479
on this problem of image classification.

00:40:13.479 --> 00:40:15.479
And, throughout the rest of the quarter,

00:40:15.479 --> 00:40:17.100
we're going to really dive in deep,

00:40:17.100 --> 00:40:19.116
and you'll understand exactly
how these different models

00:40:19.116 --> 00:40:19.949
work.

00:40:22.514 --> 00:40:24.665
But, one point that's really important,

00:40:24.665 --> 00:40:27.348
it's true that the breakthrough moment

00:40:27.348 --> 00:40:30.260
for convolutional neural
networks was in 2012

00:40:30.260 --> 00:40:32.394
when these networks performed very well

00:40:32.394 --> 00:40:34.822
on the ImageNet Challenge,
but they certainly weren't

00:40:34.822 --> 00:40:36.551
invented in 2012.

00:40:36.551 --> 00:40:38.186
These algorithms had actually been around

00:40:38.186 --> 00:40:40.310
for quite a long time before that.

00:40:40.310 --> 00:40:43.796
So one of the sort of foundational works

00:40:43.796 --> 00:40:46.157
in this area of
convolutional neural networks

00:40:46.157 --> 00:40:50.450
was actually in the '90s from
Jan LeCun and collaborators

00:40:50.450 --> 00:40:53.633
who at that time were at Bell Labs.

00:40:53.633 --> 00:40:57.332
So in 1998 they build this
convolutional neural network

00:40:57.332 --> 00:40:58.829
for recognizing digits.

00:40:58.829 --> 00:41:02.591
They wanted to deploy
this and wanted to be able

00:41:02.591 --> 00:41:04.668
to automatically recognize
handwritten checks

00:41:04.668 --> 00:41:07.366
or addresses for the post office.

00:41:07.366 --> 00:41:09.384
And, they built this
convolutional neural network

00:41:09.384 --> 00:41:11.658
which could take in the pixels of an image

00:41:11.658 --> 00:41:14.582
and then classify either what digit it was

00:41:14.582 --> 00:41:17.237
or what letter it was or whatnot.

00:41:17.237 --> 00:41:19.206
And, the structure of this network

00:41:19.206 --> 00:41:21.206
actually look pretty
similar to the AlexNet

00:41:21.206 --> 00:41:23.618
architecture that was used in 2012.

00:41:23.618 --> 00:41:25.449
Here we see that, you know, we're taking

00:41:25.449 --> 00:41:26.678
in these raw pixels.

00:41:26.678 --> 00:41:29.080
We have many layers of
convolution and sub-sampling,

00:41:29.080 --> 00:41:31.398
together with the so called
fully connected layers.

00:41:31.398 --> 00:41:33.395
All of which will be
explained in much more detail

00:41:33.395 --> 00:41:34.714
later in the course.

00:41:34.714 --> 00:41:36.716
But, if you just kind of
look at these two pictures,

00:41:36.716 --> 00:41:38.397
they look pretty similar.

00:41:38.397 --> 00:41:41.730
And, this architecture in 2012 has a lot

00:41:42.609 --> 00:41:44.449
of these architectural similarities

00:41:44.449 --> 00:41:49.299
that are shared with this
network going back to the '90s.

00:41:49.299 --> 00:41:50.816
So then the question you might ask

00:41:50.816 --> 00:41:53.377
is if these algorithms
were around since the '90s,

00:41:53.377 --> 00:41:55.815
why have they only suddenly become popular

00:41:55.815 --> 00:41:57.454
in the last couple of years?

00:41:57.454 --> 00:41:59.303
And, there's a couple
really key innovations

00:41:59.303 --> 00:42:03.277
that happened that have
changed since the '90s.

00:42:03.277 --> 00:42:05.351
One is computation.

00:42:05.351 --> 00:42:07.021
Thanks to Moore's law, we've gotten

00:42:07.021 --> 00:42:09.217
faster and faster computers every year.

00:42:09.217 --> 00:42:11.233
And, this is kind of a coarse measure,

00:42:11.233 --> 00:42:13.234
but if you just look at
the number of transistors

00:42:13.234 --> 00:42:15.129
that are on chips, then that has grown

00:42:15.129 --> 00:42:18.574
by several orders of magnitude
between the '90s and today.

00:42:18.574 --> 00:42:23.043
We've also had this advent
of graphics processing units

00:42:23.043 --> 00:42:25.878
or GPUs which are super parallelizable

00:42:25.878 --> 00:42:28.105
and ended up being a perfect tool

00:42:28.105 --> 00:42:30.866
for really crunching these
computationally intensive

00:42:30.866 --> 00:42:33.032
convolutional neural network models.

00:42:33.032 --> 00:42:35.941
So just by having more compute available,

00:42:35.941 --> 00:42:39.724
it allowed researchers to
explore with larger architectures

00:42:39.724 --> 00:42:42.150
and larger models, and in some cases,

00:42:42.150 --> 00:42:44.126
just increasing the model
size, but still using

00:42:44.126 --> 00:42:46.838
these kind of classical approaches
and classical algorithms

00:42:46.838 --> 00:42:48.476
tends to work quite well.

00:42:48.476 --> 00:42:51.415
So this idea of increasing computation

00:42:51.415 --> 00:42:55.554
is super important in the
history of deep learning.

00:42:55.554 --> 00:42:58.647
I think the second key
innovation that changed

00:42:58.647 --> 00:43:00.559
between now and the '90s was data.

00:43:00.559 --> 00:43:04.258
So these algorithms are
very hungry for data.

00:43:04.258 --> 00:43:06.319
You need to feed them
a lot of labeled images

00:43:06.319 --> 00:43:09.395
and labeled pixels for them
to eventually work quite well.

00:43:09.395 --> 00:43:11.653
And, in the '90s there just wasn't

00:43:11.653 --> 00:43:14.141
that much labeled data available.

00:43:14.141 --> 00:43:17.489
This was, again, before
tools like Mechanical Turk,

00:43:17.489 --> 00:43:20.232
before the internet was
super, super widely used.

00:43:20.232 --> 00:43:21.871
And, it was very difficult to collect

00:43:21.871 --> 00:43:23.614
large, varied datasets.

00:43:23.614 --> 00:43:27.531
But, now in the 2010s
with datasets like PASCAL

00:43:28.583 --> 00:43:31.633
and ImageNet, there existed
these relatively large,

00:43:31.633 --> 00:43:34.228
high quality labeled
datasets that were, again,

00:43:34.228 --> 00:43:36.590
orders and orders magnitude bigger

00:43:36.590 --> 00:43:38.775
than the dataset available in the '90s.

00:43:38.775 --> 00:43:40.622
And, these much large datasets, again,

00:43:40.622 --> 00:43:43.153
allowed us to work with
higher capacity models

00:43:43.153 --> 00:43:45.261
and train these models to
actually work quite well

00:43:45.261 --> 00:43:47.157
on real world problems.

00:43:47.157 --> 00:43:49.262
But, the critical takeaway here is

00:43:49.262 --> 00:43:51.023
that convolutional neural networks

00:43:51.023 --> 00:43:54.159
although they seem like this
sort of fancy, new thing

00:43:54.159 --> 00:43:56.117
that's only popped up in
the last couple of years,

00:43:56.117 --> 00:43:57.527
that's really not the case.

00:43:57.527 --> 00:43:59.583
And, these class of
algorithms have existed

00:43:59.583 --> 00:44:03.666
for quite a long time in
their own right as well.

00:44:05.015 --> 00:44:07.915
Another thing I'd like to point out

00:44:07.915 --> 00:44:09.724
in computer vision we're in the business

00:44:09.724 --> 00:44:12.755
of trying to build machines
that can see like people.

00:44:12.755 --> 00:44:15.257
And, people can actually
do a lot of amazing things

00:44:15.257 --> 00:44:16.650
with their visual systems.

00:44:16.650 --> 00:44:18.498
When you go around the world,

00:44:18.498 --> 00:44:21.034
you do a lot more than just drawing boxes

00:44:21.034 --> 00:44:24.988
around the objects and classifying
things as cats or dogs.

00:44:24.988 --> 00:44:27.711
Your visual system is much
more powerful than that.

00:44:27.711 --> 00:44:29.415
And, as we move forward in the field,

00:44:29.415 --> 00:44:31.612
I think there's still a
ton of open challenges

00:44:31.612 --> 00:44:34.047
and open problems that we need to address.

00:44:34.047 --> 00:44:36.630
And, we need to continue
to develop our algorithms

00:44:36.630 --> 00:44:40.220
to do even better and tackle
even more ambitious problems.

00:44:40.220 --> 00:44:42.964
Some examples of this are
going back to these older ideas

00:44:42.964 --> 00:44:44.043
in fact.

00:44:44.043 --> 00:44:46.923
Things like semantic segmentation
or perceptual grouping

00:44:46.923 --> 00:44:49.292
where rather than
labeling the entire image,

00:44:49.292 --> 00:44:51.969
we want to understand for
every pixel in the image

00:44:51.969 --> 00:44:53.866
what is it doing, what does it mean.

00:44:53.866 --> 00:44:55.661
And, we'll revisit that
idea a little bit later

00:44:55.661 --> 00:44:56.846
in the course.

00:44:56.846 --> 00:44:58.453
There's definitely work going back

00:44:58.453 --> 00:45:00.134
to this idea of 3D understanding,

00:45:00.134 --> 00:45:02.377
of reconstructing the entire world,

00:45:02.377 --> 00:45:06.127
and that's still an
unsolved problem I think.

00:45:07.498 --> 00:45:09.010
There're just tons and tons of other tasks

00:45:09.010 --> 00:45:10.178
that you can imagine.

00:45:10.178 --> 00:45:11.817
For example activity recognition,

00:45:11.817 --> 00:45:13.438
if I'm given a video of some person

00:45:13.438 --> 00:45:15.212
doing some activity, what's the best way

00:45:15.212 --> 00:45:16.725
to recognize that activity?

00:45:16.725 --> 00:45:19.469
That's quite a challenging
problem as well.

00:45:19.469 --> 00:45:21.286
And, then as we move forward with things

00:45:21.286 --> 00:45:23.274
like augmented reality
and virtual reality,

00:45:23.274 --> 00:45:25.332
and as new technologies
and new types of sensors

00:45:25.332 --> 00:45:27.578
become available, I think we'll come up

00:45:27.578 --> 00:45:29.955
with a lot of new, interesting
hard and challenging

00:45:29.955 --> 00:45:32.455
problems to tackle as a field.

00:45:33.916 --> 00:45:37.924
So this is an example
from some of my own work

00:45:37.924 --> 00:45:42.228
in the vision lab on this
dataset called Visual Genome.

00:45:42.228 --> 00:45:45.426
So here the idea is that
we're trying to capture

00:45:45.426 --> 00:45:47.474
some of these intricacies
in the real world.

00:45:47.474 --> 00:45:49.793
Rather than maybe describing just boxes,

00:45:49.793 --> 00:45:52.308
maybe we should be describing images

00:45:52.308 --> 00:45:55.056
as these whole large graphs
of semantically related

00:45:55.056 --> 00:45:57.525
concepts that encompass
not just object identities

00:45:57.525 --> 00:46:00.451
but also object relationships,
object attributes,

00:46:00.451 --> 00:46:02.590
actions that are occurring in the scene,

00:46:02.590 --> 00:46:06.971
and this type of
representation might allow us

00:46:06.971 --> 00:46:09.527
to capture some of this
richness of the visual world

00:46:09.527 --> 00:46:11.225
that's left on the table when we're using

00:46:11.225 --> 00:46:12.889
simple classification.

00:46:12.889 --> 00:46:15.270
This is by no means a standard
approach at this point,

00:46:15.270 --> 00:46:17.330
but just kind of giving you this sense

00:46:17.330 --> 00:46:19.635
that there's so much more
that your visual system can do

00:46:19.635 --> 00:46:22.590
that is maybe not captured in this vanilla

00:46:22.590 --> 00:46:24.840
image classification setup.

00:46:28.003 --> 00:46:29.744
I think another really interesting work

00:46:29.744 --> 00:46:31.592
that kind of points in this direction

00:46:31.592 --> 00:46:34.145
actually comes from
Fei-Fei's grad school days

00:46:34.145 --> 00:46:36.843
when she was doing her PHD at Cal Tech

00:46:36.843 --> 00:46:38.952
with her advisors there.

00:46:38.952 --> 00:46:41.692
In this setup, they had
people, they stuck people,

00:46:41.692 --> 00:46:44.604
and they showed people this
image for just half a second.

00:46:44.604 --> 00:46:46.302
So they flashed this
image in front of them

00:46:46.302 --> 00:46:47.896
for just a very short period of time,

00:46:47.896 --> 00:46:50.169
and even in this very, very rapid exposure

00:46:50.169 --> 00:46:52.108
to an image, people were able to write

00:46:52.108 --> 00:46:54.033
these long descriptive paragraphs

00:46:54.033 --> 00:46:56.473
giving a whole story of the image.

00:46:56.473 --> 00:47:00.284
And, this is quite remarkable
if you think about it

00:47:00.284 --> 00:47:03.692
that after just half a second
of looking at this image,

00:47:03.692 --> 00:47:05.560
a person was able to say that this is

00:47:05.560 --> 00:47:08.481
some kind of a game or
fight, two groups of men.

00:47:08.481 --> 00:47:10.375
The man on the left is throwing something.

00:47:10.375 --> 00:47:13.134
Outdoors because it seem like
I have an impression of grass,

00:47:13.134 --> 00:47:14.576
and so on and so on.

00:47:14.576 --> 00:47:16.016
And, you can imagine that if a person

00:47:16.016 --> 00:47:17.617
were to look even longer at this image,

00:47:17.617 --> 00:47:19.169
they could write probably a whole novel

00:47:19.169 --> 00:47:20.942
about who these people
are, and why are they

00:47:20.942 --> 00:47:22.307
in this field playing this game.

00:47:22.307 --> 00:47:23.685
They could go on and on and on

00:47:23.685 --> 00:47:25.613
roping in things from
their external knowledge

00:47:25.613 --> 00:47:27.187
and their prior experience.

00:47:27.187 --> 00:47:30.297
This is in some sense the
holy grail of computer vision.

00:47:30.297 --> 00:47:32.659
To sort of understand
the story of an image

00:47:32.659 --> 00:47:34.663
in a very rich and deep way.

00:47:34.663 --> 00:47:36.932
And, I think that despite
the massive progress

00:47:36.932 --> 00:47:39.706
in the field that we've had
over the past several years,

00:47:39.706 --> 00:47:44.460
we're still quite a long way
from achieving this holy grail.

00:47:44.460 --> 00:47:46.563
Another image that I
think really exemplifies

00:47:46.563 --> 00:47:50.472
this idea actually comes, again,
from Andrej Karpathy's blog

00:47:50.472 --> 00:47:52.890
is this amazing image.

00:47:52.890 --> 00:47:54.391
Many of you smiled, many of you laughed.

00:47:54.391 --> 00:47:56.212
I think this is a pretty funny image.

00:47:56.212 --> 00:47:57.696
But, why is it a funny image?

00:47:57.696 --> 00:47:59.895
Well we've got a man standing on a scale,

00:47:59.895 --> 00:48:01.607
and we know that people
are kind of self conscious

00:48:01.607 --> 00:48:04.380
about their weight sometimes,
and scales measure weight.

00:48:04.380 --> 00:48:06.899
Then we've got this other guy behind him

00:48:06.899 --> 00:48:08.791
pushing his foot down on the scale,

00:48:08.791 --> 00:48:10.900
and we know that because
of the way scales work

00:48:10.900 --> 00:48:12.958
that will cause him to
have an inflated reading

00:48:12.958 --> 00:48:13.867
on the scale.

00:48:13.867 --> 00:48:14.895
But, there's more.

00:48:14.895 --> 00:48:16.819
We know that this person
is not just any person.

00:48:16.819 --> 00:48:19.500
This is actually Barack
Obama who was at the time

00:48:19.500 --> 00:48:20.905
President of the United States,

00:48:20.905 --> 00:48:22.541
and we know that Presidents
of the United States

00:48:22.541 --> 00:48:24.741
are supposed to be respectable
politicians that are

00:48:24.741 --> 00:48:27.045
[laughing]

00:48:27.045 --> 00:48:29.154
probably not supposed to be playing jokes

00:48:29.154 --> 00:48:31.304
on their compatriots in this way.

00:48:31.304 --> 00:48:32.713
We know that there's these people

00:48:32.713 --> 00:48:34.564
in the background that
are laughing and smiling,

00:48:34.564 --> 00:48:36.066
and we know that that means that they're

00:48:36.066 --> 00:48:37.912
understanding something about the scene.

00:48:37.912 --> 00:48:39.597
We have some understanding that they know

00:48:39.597 --> 00:48:41.575
that President Obama
is this respectable guy

00:48:41.575 --> 00:48:42.866
who's looking at this other guy.

00:48:42.866 --> 00:48:43.767
Like, this is crazy.

00:48:43.767 --> 00:48:45.830
There's so much going on in this image.

00:48:45.830 --> 00:48:48.167
And, our computer vision algorithms today

00:48:48.167 --> 00:48:51.108
are actually a long way
I think from this true,

00:48:51.108 --> 00:48:53.002
deep understanding of images.

00:48:53.002 --> 00:48:56.032
So I think that sort of
despite the massive progress

00:48:56.032 --> 00:48:58.777
in the field, we really
have a long way to go.

00:48:58.777 --> 00:49:01.385
To me, that's really
exciting as a researcher

00:49:01.385 --> 00:49:02.630
'cause I think that we'll have

00:49:02.630 --> 00:49:04.611
just a lot of really
exciting, cool problems

00:49:04.611 --> 00:49:06.694
to tackle moving forward.

00:49:07.913 --> 00:49:10.202
So I hope at this point I've
done a relatively good job

00:49:10.202 --> 00:49:13.054
to convince you that computer
vision is really interesting.

00:49:13.054 --> 00:49:14.208
It's really exciting.

00:49:14.208 --> 00:49:16.329
It can be very useful.

00:49:16.329 --> 00:49:18.315
It can go out and make
the world a better place

00:49:18.315 --> 00:49:20.043
in various ways.

00:49:20.043 --> 00:49:21.591
Computer vision could be applied

00:49:21.591 --> 00:49:24.559
in places like medical
diagnosis and self-driving cars

00:49:24.559 --> 00:49:28.134
and robotics and all
these different places.

00:49:28.134 --> 00:49:30.713
In addition to sort of tying
back to sort of this core

00:49:30.713 --> 00:49:33.120
idea of understanding human intelligence.

00:49:33.120 --> 00:49:34.849
So to me, I think that computer vision

00:49:34.849 --> 00:49:37.141
is this fantastically
amazing, interesting field,

00:49:37.141 --> 00:49:38.775
and I'm really glad that over the course

00:49:38.775 --> 00:49:40.475
of the quarter, we'll
get to really dive in

00:49:40.475 --> 00:49:42.337
and dig into all these different details

00:49:42.337 --> 00:49:46.234
about how these algorithms
are working these days.

00:49:46.234 --> 00:49:48.949
That's sort of my pitch
about computer vision

00:49:48.949 --> 00:49:50.673
and about the history of computer vision.

00:49:50.673 --> 00:49:52.283
I don't know if there's
any questions about this

00:49:52.283 --> 00:49:53.366
at this time.

00:49:55.707 --> 00:49:57.055
Okay.

00:49:57.055 --> 00:49:58.345
So then I want to talk a little bit more

00:49:58.345 --> 00:50:00.408
about the logistics of this class

00:50:00.408 --> 00:50:02.408
for the rest of the quarter.

00:50:02.408 --> 00:50:04.382
So you might ask who are we?

00:50:04.382 --> 00:50:06.904
So this class is taught by Fei-Fei Li

00:50:06.904 --> 00:50:11.271
who is a professor of computer
science here at Standford

00:50:11.271 --> 00:50:14.516
who's my advisor and director
of the Stanford Vision Lab

00:50:14.516 --> 00:50:16.852
and also the Stanford AI Lab.

00:50:16.852 --> 00:50:20.081
The other two instructors
are me, Justin Johnson,

00:50:20.081 --> 00:50:22.519
and Serena Yeung who is
up here in the front.

00:50:22.519 --> 00:50:25.219
We're both PHD students
working under Fei-Fei

00:50:25.219 --> 00:50:27.379
on various computer vision problems.

00:50:27.379 --> 00:50:29.996
We have an amazing
teaching staff this year

00:50:29.996 --> 00:50:31.920
of 18 TAs so far.

00:50:31.920 --> 00:50:34.179
Many of whom are sitting
over here in the front.

00:50:34.179 --> 00:50:35.921
These guys are really the unsung heroes

00:50:35.921 --> 00:50:38.527
behind the scenes making
the course run smoothly,

00:50:38.527 --> 00:50:40.320
making sure everything happens well.

00:50:40.320 --> 00:50:42.365
So be nice to them.

00:50:42.365 --> 00:50:44.196
[laughing]

00:50:44.196 --> 00:50:47.153
I think I also should mention
this is the third time

00:50:47.153 --> 00:50:49.216
we've taught this course,
and it's the first time

00:50:49.216 --> 00:50:51.652
that Andrej Karpathy has
not been an instructor

00:50:51.652 --> 00:50:53.050
in this course.

00:50:53.050 --> 00:50:56.192
He was a very close friend of mine.

00:50:56.192 --> 00:50:57.093
He's still alive.

00:50:57.093 --> 00:50:58.353
He's okay, don't worry.

00:50:58.353 --> 00:50:59.612
[laughing]

00:50:59.612 --> 00:51:02.780
But, he graduated, so he's actually here

00:51:02.780 --> 00:51:05.724
I think hanging around
in the lecture hall.

00:51:05.724 --> 00:51:07.662
A lot of the development and
the history of this course

00:51:07.662 --> 00:51:09.570
is really due to him working on it

00:51:09.570 --> 00:51:11.617
with me over the last couple of years.

00:51:11.617 --> 00:51:15.398
So I think you should be aware of that.

00:51:15.398 --> 00:51:18.194
Also about logistics,
probably the best way

00:51:18.194 --> 00:51:20.904
for keeping in touch with the course staff

00:51:20.904 --> 00:51:22.209
is through Piazza.

00:51:22.209 --> 00:51:25.212
You should all go and signup right now.

00:51:25.212 --> 00:51:27.597
Piazza is really our preferred
method of communication

00:51:27.597 --> 00:51:30.353
with the class with the teaching staff.

00:51:30.353 --> 00:51:32.621
If you have questions that you're afraid

00:51:32.621 --> 00:51:34.313
of being embarrassed about asking

00:51:34.313 --> 00:51:36.067
in front of your classmates, go ahead

00:51:36.067 --> 00:51:38.602
and ask anonymously even
post private questions

00:51:38.602 --> 00:51:40.572
directly to the teaching staff.

00:51:40.572 --> 00:51:42.269
So basically anything that you need

00:51:42.269 --> 00:51:44.452
should ideally go through Piazza.

00:51:44.452 --> 00:51:46.445
We also have a staff mailing list,

00:51:46.445 --> 00:51:48.422
but we ask that this is mostly

00:51:48.422 --> 00:51:51.302
for sort of personal, confidential things

00:51:51.302 --> 00:51:53.517
that you don't want going on Piazza,

00:51:53.517 --> 00:51:55.773
or if you have something
that's super confidential,

00:51:55.773 --> 00:51:58.365
super personal, then feel free

00:51:58.365 --> 00:52:02.125
to directly email me or
Fei-Fei or Serena about that.

00:52:02.125 --> 00:52:03.900
But, for the most part,
most of your communication

00:52:03.900 --> 00:52:06.096
with the staff should be through Piazza.

00:52:06.096 --> 00:52:08.660
We also have an optional
textbook this year.

00:52:08.660 --> 00:52:10.401
This is by no means required.

00:52:10.401 --> 00:52:12.616
You can go through the course
totally fine without it.

00:52:12.616 --> 00:52:14.372
Everything will be self contained.

00:52:14.372 --> 00:52:17.770
This is sort of exciting
because it's maybe the first

00:52:17.770 --> 00:52:19.786
textbook about deep
learning that got published

00:52:19.786 --> 00:52:21.889
earlier this year by E.N. Goodfellow,

00:52:21.889 --> 00:52:24.078
Yoshua Bengio, and Aaron Courville.

00:52:24.078 --> 00:52:26.684
I put the Amazon link here in the slides.

00:52:26.684 --> 00:52:28.197
You can get it if you want to,

00:52:28.197 --> 00:52:30.079
but also the whole content of the book

00:52:30.079 --> 00:52:31.807
is free online, so you
don't even have to buy it

00:52:31.807 --> 00:52:32.943
if you don't want to.

00:52:32.943 --> 00:52:34.261
So again, this is totally optional,

00:52:34.261 --> 00:52:35.778
but we'll probably be
posting some readings

00:52:35.778 --> 00:52:37.614
throughout the quarter
that give you an additional

00:52:37.614 --> 00:52:40.614
perspective on some of the material.

00:52:41.697 --> 00:52:43.259
So our philosophy about this class

00:52:43.259 --> 00:52:47.035
is that you should really
understand the deep mechanics

00:52:47.035 --> 00:52:48.794
of all of these algorithms.

00:52:48.794 --> 00:52:50.671
You should understand at a very deep level

00:52:50.671 --> 00:52:52.717
exactly how these algorithms are working

00:52:52.717 --> 00:52:54.295
like what exactly is going on when you're

00:52:54.295 --> 00:52:56.097
stitching together these neural networks,

00:52:56.097 --> 00:52:58.128
how do these architectural decisions

00:52:58.128 --> 00:53:00.144
influence how the network is trained

00:53:00.144 --> 00:53:02.314
and tested and whatnot and all that.

00:53:02.314 --> 00:53:05.211
And, throughout the course
through the assignments,

00:53:05.211 --> 00:53:07.163
you'll be implementing
your own convolutional

00:53:07.163 --> 00:53:08.757
neural networks from scratch in Python.

00:53:08.757 --> 00:53:11.560
You'll be implementing the
full forward and backward

00:53:11.560 --> 00:53:13.260
passes through these
things, and by the end,

00:53:13.260 --> 00:53:15.106
you'll have implemented a whole
convolutional neural network

00:53:15.106 --> 00:53:16.320
totally on your own.

00:53:16.320 --> 00:53:18.320
I think that's really cool.

00:53:18.320 --> 00:53:20.569
But, we also kind of
practical, and we know

00:53:20.569 --> 00:53:23.520
that in most cases people
are not writing these things

00:53:23.520 --> 00:53:25.613
from scratch, so we also want to give you

00:53:25.613 --> 00:53:27.769
a good introduction to some
of the state of the art

00:53:27.769 --> 00:53:31.326
software tools that are used
in practice for these things.

00:53:31.326 --> 00:53:33.373
So we're going to talk about
some of the state of the art

00:53:33.373 --> 00:53:36.392
software packages like Tensor
Flow, Torch, [Py]Torch,

00:53:36.392 --> 00:53:37.663
all these other things.

00:53:37.663 --> 00:53:39.890
And, I think you'll get some exposure

00:53:39.890 --> 00:53:42.636
to those on the homeworks
and definitely through

00:53:42.636 --> 00:53:44.528
the course project as well.

00:53:44.528 --> 00:53:46.303
Another note about this course

00:53:46.303 --> 00:53:47.820
is that it's very state of the art.

00:53:47.820 --> 00:53:49.122
I think it's super exciting.

00:53:49.122 --> 00:53:50.715
This is a very fast moving field.

00:53:50.715 --> 00:53:53.337
As you saw, even these plots
in the imaging challenge

00:53:53.337 --> 00:53:55.611
basically there's been a ton of progress

00:53:55.611 --> 00:53:58.840
since 2012, and like while
I've been in grad school,

00:53:58.840 --> 00:54:00.538
the whole field is sort
of transforming ever year.

00:54:00.538 --> 00:54:03.749
And, that's super exciting
and super encouraging.

00:54:03.749 --> 00:54:07.177
But, what that means is that
there's probably content

00:54:07.177 --> 00:54:09.132
that we'll cover this
year that did not exist

00:54:09.132 --> 00:54:12.893
the last time that this
course was taught last year.

00:54:12.893 --> 00:54:14.417
I think that's super
exciting, and that's one

00:54:14.417 --> 00:54:16.629
of my favorite parts
about teaching this course

00:54:16.629 --> 00:54:18.826
is just roping in all
these new scientific,

00:54:18.826 --> 00:54:21.041
hot off the presses stuff and being able

00:54:21.041 --> 00:54:24.041
to present it to you guys.

00:54:24.041 --> 00:54:26.071
We're also sort of about fun.

00:54:26.071 --> 00:54:27.770
So we're going to talk
about some interesting

00:54:27.770 --> 00:54:30.453
maybe not so serious
topics as well this quarter

00:54:30.453 --> 00:54:33.122
including image captioning is pretty fun

00:54:33.122 --> 00:54:35.349
where we can write
descriptions about images.

00:54:35.349 --> 00:54:37.177
But, we'll also cover some
of these more artistic things

00:54:37.177 --> 00:54:39.896
like DeepDream here on the left

00:54:39.896 --> 00:54:42.261
where we can use neural
networks to hallucinate

00:54:42.261 --> 00:54:44.277
these crazy, psychedelic images.

00:54:44.277 --> 00:54:45.975
And, by the end of the course, you'll know

00:54:45.975 --> 00:54:46.877
how that works.

00:54:46.877 --> 00:54:48.900
Or on the right, this
idea of style transfer

00:54:48.900 --> 00:54:50.628
where we can take an image and render it

00:54:50.628 --> 00:54:54.507
in the style of famous artists
like Picasso or Van Gogh

00:54:54.507 --> 00:54:55.340
or what not.

00:54:55.340 --> 00:54:56.654
And again, by the end of the quarter,

00:54:56.654 --> 00:54:59.654
you'll see how this stuff works.

00:54:59.654 --> 00:55:02.519
So the way the course works
is we're going to have

00:55:02.519 --> 00:55:03.794
three problem sets.

00:55:03.794 --> 00:55:07.039
The first problem set
will hopefully be out

00:55:07.039 --> 00:55:08.252
by the end of the week.

00:55:08.252 --> 00:55:10.706
We'll have an in class,
written midterm exam.

00:55:10.706 --> 00:55:12.511
And, a large portion of your grade

00:55:12.511 --> 00:55:15.056
will be the final course
project where you'll work

00:55:15.056 --> 00:55:17.407
in teams of one to three and produce

00:55:17.407 --> 00:55:20.514
some amazing project that
will blow everyone's minds.

00:55:20.514 --> 00:55:23.871
We have a late policy, so
you have seven late days

00:55:23.871 --> 00:55:26.380
that you're free to allocate
among your different homeworks.

00:55:26.380 --> 00:55:29.549
These are meant to cover
things like minor illnesses

00:55:29.549 --> 00:55:34.204
or traveling or conferences
or anything like that.

00:55:34.204 --> 00:55:36.188
If you come to us at
the end of the quarter

00:55:36.188 --> 00:55:38.757
and say that, "I suddenly
have to give a presentation

00:55:38.757 --> 00:55:39.971
"at this conference."

00:55:39.971 --> 00:55:40.880
That's not going to be okay.

00:55:40.880 --> 00:55:42.624
That's what your late days are for.

00:55:42.624 --> 00:55:44.111
That being said, if you have some

00:55:44.111 --> 00:55:46.643
very extenuating circumstances,
then do feel free

00:55:46.643 --> 00:55:48.705
to email the course staff
if you have some extreme

00:55:48.705 --> 00:55:50.295
circumstances about that.

00:55:50.295 --> 00:55:52.404
Finally, I want to make a note

00:55:52.404 --> 00:55:54.177
about the collaboration policy.

00:55:54.177 --> 00:55:55.921
As Stanford students,
you should all be aware

00:55:55.921 --> 00:55:58.389
of the honor code that governs the way

00:55:58.389 --> 00:56:00.785
that you should be collaborating
and working together,

00:56:00.785 --> 00:56:03.609
and we take this very seriously.

00:56:03.609 --> 00:56:05.635
We encourage you to think very carefully

00:56:05.635 --> 00:56:07.620
about how you're
collaborating and making sure

00:56:07.620 --> 00:56:11.037
it's within the bounds of the honor code.

00:56:12.304 --> 00:56:14.378
So in terms of prerequisites,
I think the most important

00:56:14.378 --> 00:56:17.492
is probably a deep familiarity with Python

00:56:17.492 --> 00:56:20.081
because all of the programming assignments

00:56:20.081 --> 00:56:22.339
will be in Python.

00:56:22.339 --> 00:56:26.066
Some familiarity with C
or C++ would be useful.

00:56:26.066 --> 00:56:29.354
You will probably not
be writing any C or C++

00:56:29.354 --> 00:56:31.705
in this course, but as you're
browsing through the source

00:56:31.705 --> 00:56:33.676
code of these various software packages,

00:56:33.676 --> 00:56:35.922
being able to read C++ code at least

00:56:35.922 --> 00:56:39.879
is very useful for understanding
how these packages work.

00:56:39.879 --> 00:56:42.439
We also assume that you
know what calculus is,

00:56:42.439 --> 00:56:44.971
you know how to take derivatives
all that sort of stuff.

00:56:44.971 --> 00:56:46.533
We assume some linear algebra.

00:56:46.533 --> 00:56:47.879
That you know what matrices are

00:56:47.879 --> 00:56:52.072
and how to multiply them
and stuff like that.

00:56:52.072 --> 00:56:53.660
We can't be teaching you how to take

00:56:53.660 --> 00:56:55.691
like derivatives and stuff.

00:56:55.691 --> 00:56:57.321
We also assume a little bit of knowledge

00:56:57.321 --> 00:56:59.821
coming in of computer
vision maybe at the level

00:56:59.821 --> 00:57:01.238
of CS131 or 231a.

00:57:02.367 --> 00:57:03.923
If you have taken those courses before,

00:57:03.923 --> 00:57:05.120
you'll be fine.

00:57:05.120 --> 00:57:07.347
If you haven't, I think
you'll be okay in this class,

00:57:07.347 --> 00:57:09.853
but you might have a tiny
bit of catching up to do.

00:57:09.853 --> 00:57:11.550
But, I think you'll probably be okay.

00:57:11.550 --> 00:57:13.704
Those are not super strict prerequisites.

00:57:13.704 --> 00:57:16.964
We also assume a little
bit of background knowledge

00:57:16.964 --> 00:57:20.540
about machine learning
maybe at the level of CS229.

00:57:20.540 --> 00:57:23.556
But again, I think really
important, key fundamental

00:57:23.556 --> 00:57:25.723
machine learning concepts
we'll reintroduce

00:57:25.723 --> 00:57:27.755
as they come up and become important.

00:57:27.755 --> 00:57:29.916
But, that being said, a
familiarity with these things

00:57:29.916 --> 00:57:32.416
will be helpful going forward.

00:57:34.774 --> 00:57:36.046
So we have a course website.

00:57:36.046 --> 00:57:36.950
Go check it out.

00:57:36.950 --> 00:57:38.303
There's a lot of information and links

00:57:38.303 --> 00:57:39.742
and syllabus and all that.

00:57:39.742 --> 00:57:43.656
I think that's all that I
really want to cover today.

00:57:43.656 --> 00:57:46.157
And, then later this week on Thursday,

00:57:46.157 --> 00:57:48.733
we'll really dive into our
first learning algorithm

00:57:48.733 --> 00:00:00.000
and start diving into the
details of these things.